{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd \n",
    "\n",
    "import matplotlib as mpl\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib.gridspec as gridspec\n",
    "%matplotlib inline\n",
    "import seaborn as sns\n",
    "plt.style.use('seaborn-talk')\n",
    "plt.style.use('bmh')\n",
    "plt.rcParams['font.weight'] = 'medium'\n",
    "#plt.rcParams['figure.figsize'] = 10,7\n",
    "blue, green, red, purple, gold, teal = sns.color_palette('colorblind', 6)\n",
    "\n",
    "import os\n",
    "print(f\"Current dir: {os.getcwd()}\")\n",
    "os.chdir('..')\n",
    "print(f\"Current dir: {os.getcwd()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Chapter 8: Feature Importance\n",
    "___\n",
    "\n",
    "## Exercises"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**8.1** Using the code presented in Section 8.6:\n",
    "- **(a)** Generate a dataset $(X, y)$.\n",
    "- **(b)** Apply a PCA transformation on $X$, which we denote $\\dot{X}$.\n",
    "- **(c)** Compute MDI, MDA, and SFI feature importance on $(\\dot{X}, y)$, where the base estimator is RF.\n",
    "- **(d)** Do the three methods agree on what features are important? Why?\n",
    "\n",
    "**8.2** From exercise 1, generate a new dataset  $(\\ddot{X}, y)$, where $\\ddot{X}$ is a feature union of X and Ẋ.\n",
    "- **(a)** Compute MDI, MDA, and SFI feature importance on $(\\ddot{X}, y)$, where the base estimator is RF.\n",
    "- **(b)** Do the three methods agree on the important features? Why?\n",
    "\n",
    "**8.3** Take the results from exercise 2:\n",
    "- **(a)** Drop the most important features according to each method, resulting in a features matrix $\\dddot{X}$.\n",
    "- **(b)** Compute MDI, MDA, and SFI feature importance on $(\\dddot{X}, y)$, where the base estimator is RF.\n",
    "- **(c)** Do you appreciate significant changes in the rankings of important features, relative to the results from exercise 2?\n",
    "\n",
    "**8.4** Using the code presented in Section 8.6:\n",
    "- **(a)**  Generate a dataset $(X, y)$ of 1E6 observations, where 5 features are informa-\n",
    "tive, 5 are redundant and 10 are noise.\n",
    "- **(b)**  Split (X, y) into 10 datasets $\\{(X_i, y_i)\\}_{i=1,...,10}$ each of 1E5 observations.\n",
    "- **(c)**  Compute the parallelized feature importance (Section8.5),on each of the 10 datasets, $\\{(X_i, y_i)\\}_{i=1,...,10}$.\n",
    "- **(d)**  Compute the stacked feature importance on the combined dataset $(X, y)$.\n",
    "- **(r)**  What causes the discrepancy between the two? Which one is more reliable?\n",
    "\n",
    "**8.5** Repeat all MDI calculations from exercises 1–4, but this time allow for masking effects. That means, do not set max_features=int(1) in Snippet 8.2. How do results differ as a consequence of this change? Why?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from src.snippets.ch8 import getTestData"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "My own SNIPPETS 8.8, 8.9 and 8.10"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_features=40 \n",
    "n_informative=10\n",
    "n_redundant=10\n",
    "n_estimators=1000\n",
    "n_samples=10000\n",
    "cv = 10\n",
    "\n",
    "trnsX,cont=getTestData(n_features,n_informative,n_redundant,n_samples)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from src.snippets.ch8 import featImpMDI\n",
    "from src.snippets.ch8 import featImpMDA\n",
    "from src.snippets.ch8 import auxFeatImpSFI\n",
    "\n",
    "from src.snippets.ch7 import PurgedKFold\n",
    "from src.snippets.ch7 import cvScore"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.ensemble import BaggingClassifier\n",
    "\n",
    "def featImportance_no_para(trnsX, cont, n_estimators=1000, cv=10, max_samples=1., numThreads=24,\n",
    "                   pctEmbargo=0, scoring='accuracy', method='SFI', minWLeaf=0., **kargs):\n",
    "\n",
    "    n_jobs = (-1 if numThreads > 1 else 1)\n",
    "    #1) prepare classifier,cv. max_features=1, to prevent masking\n",
    "    clf = DecisionTreeClassifier(criterion='entropy', max_features=1,\n",
    "                                 class_weight='balanced', min_weight_fraction_leaf=minWLeaf)\n",
    "    clf = BaggingClassifier(base_estimator=clf, n_estimators=n_estimators,\n",
    "                            max_features=1., max_samples=max_samples,\n",
    "                            oob_score=True, n_jobs=n_jobs)\n",
    "    fit = clf.fit(X=trnsX, y=cont['bin'], sample_weight=cont['w'].values)\n",
    "    oob = fit.oob_score_\n",
    "    \n",
    "    print(f'oob: {oob} , {clf.oob_score_}')\n",
    "    print(f\"score: {clf.score(X=trnsX, y=cont['bin'])}\")\n",
    "    \n",
    "    if method == 'MDI':\n",
    "        imp = featImpMDI(fit, featNames=trnsX.columns)\n",
    "        oos = cvScore(clf, X=trnsX, y=cont['bin'], cv=cv, sample_weight=cont['w'],\n",
    "                      t1=cont['t1'], pctEmbargo=pctEmbargo, scoring=scoring)\n",
    "        print(f'fold acc: {oos}')\n",
    "        oos = oos.mean()\n",
    "    elif method == 'MDA':\n",
    "        imp, oos = featImpMDA(clf, X=trnsX, y=cont['bin'], cv=cv, sample_weight=cont['w'],\n",
    "                              t1=cont['t1'], pctEmbargo=pctEmbargo, scoring=scoring)\n",
    "    elif method == 'SFI':\n",
    "        \n",
    "        cvGen = PurgedKFold(n = trnsX.shape[0],\n",
    "                            n_folds=cv, \n",
    "                            t1=cont['t1'], \n",
    "                            pctEmbargo=pctEmbargo)\n",
    "        \n",
    "        oos = cvScore(clf, \n",
    "                      X=trnsX, \n",
    "                      y=cont['bin'], \n",
    "                      sample_weight=cont['w'], \n",
    "                      scoring=scoring, \n",
    "                      cvGen=cvGen).mean()\n",
    "        \n",
    "        clf.n_jobs = 1  # paralellize auxFeatImpSFI rather than clf\n",
    "        imp = auxFeatImpSFI(trnsX.columns, clf=clf, trnsX=trnsX, cont=cont, scoring=scoring, cvGen=cvGen) \n",
    "    return imp, oob, oos"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plotFeatImportance(imp, oob, oos, method, tag=0, simNum=0, **kargs):\n",
    "    '''\n",
    "    SNIPPET 8.10 FEATURE IMPORTANCE PLOTTING FUNCTION\n",
    "    plot mean imp bars with std\n",
    "    '''\n",
    "    plt.figure(figsize=(10, imp.shape[0]/5.))\n",
    "    imp = imp.sort_values('mean', ascending=True)\n",
    "    ax = imp['mean'].plot(kind='barh', color='b',\n",
    "                          alpha=.25, xerr=imp['std'],\n",
    "                          error_kw={'ecolor': 'r'})\n",
    "    if method == 'MDI':\n",
    "        plt.xlim([0, imp.sum(axis=1).max()])\n",
    "        plt.axvline(1./imp.shape[0], linewidth=1,\n",
    "                    color='r', linestyle='dotted')\n",
    "    ax.get_yaxis().set_visible(False)\n",
    "    for i, j in zip(ax.patches, imp.index):\n",
    "        ax.text(i.get_width()/2,\n",
    "                i.get_y()+i.get_height()/2, j, ha='center', va='center',\n",
    "                color='black')\n",
    "    plt.title(f'tag={tag} | simNum={simNum} | oob={oob:.{2}} | oos={oos:.{2}}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "method = 'MDI'\n",
    "imp,oob,oos = featImportance_no_para(trnsX=trnsX,cont=cont, method = method)\n",
    "plotFeatImportance(imp=imp,oob=oob,oos=oos, method = method)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "method = 'MDA'\n",
    "imp,oob,oos = featImportance_no_para(trnsX=trnsX,cont=cont, method = method)\n",
    "plotFeatImportance(imp=imp,oob=oob,oos=oos, method = method)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "method = 'SFI'\n",
    "imp,oob,oos = featImportance_no_para(trnsX=trnsX,cont=cont, method = method)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plotFeatImportance(imp=imp,oob=oob,oos=oos, method = method)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
